See also
Open this notebook in Google Colab, a free cloud-based environment that allows to run Jupyter notebooks, or download it from Github to run it on your local computer.
Extracting Patents that cite Publications from a chosen Research Organization¶
This tutorial shows how to extract and analyse patents information linked to a selected research organization, using the Dimensions Analytics API.
Load libraries and log in¶
[2]:
# @markdown # Get the API library and login
# @markdown **Privacy tip**: leave the password blank and you'll be asked for it later. This can be handy on shared computers.
username = "" #@param {type: "string"}
password = "" #@param {type: "string"}
endpoint = "https://app.dimensions.ai" #@param {type: "string"}
# import all libraries and login
!pip install dimcli plotly_express -U --quiet
import dimcli
from dimcli.shortcuts import *
dimcli.login(username, password, endpoint)
dsl = dimcli.Dsl()
#
import os
import sys
import time
import json
import pandas as pd
from pandas.io.json import json_normalize
from tqdm import tqdm_notebook as progressbar
#
# charts lib
import plotly_express as px
if not 'google.colab' in sys.modules:
# make js dependecies local / needed by html exports
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
DimCli v0.6.1.2 - Succesfully connected to <https://app.dimensions.ai> (method: dsl.ini file)
A couple of utility functions to simplify exporting CSV files to a selected folder
[3]:
#
# data-saving utils
#
DATAFOLDER = "extraction1"
#
if not os.path.exists(DATAFOLDER):
!mkdir $DATAFOLDER
print(f"==\nCreated data folder:", DATAFOLDER + "/")
#
def save_as_csv(df, save_name_without_extension):
"usage: `save_as_csv(dataframe, 'filename')`"
df.to_csv(f"{DATAFOLDER}/{save_name_without_extension}.csv", index=False)
print("===\nSaved: ", f"{DATAFOLDER}/{save_name_without_extension}.csv")
Choose a GRID Research Organization¶
For the purpose of this exercise, we will are going to use grid.89170.37. Feel free though to change the parameters below as you want, eg by choosing another GRID organization.
[2]:
GRIDID = "grid.89170.37" #@param {type:"string"}
#@markdown The start/end year of publications used to extract patents
YEAR_START = 2000 #@param {type: "slider", min: 1950, max: 2020}
YEAR_END = 2016 #@param {type: "slider", min: 1950, max: 2020}
if YEAR_END < YEAR_START:
YEAR_END = YEAR_START
#
# gen link to Dimensions
#
def dimensions_url(grids):
root = "https://app.dimensions.ai/discover/publication?or_facet_research_org="
return root + "&or_facet_research_org=".join([x for x in grids])
from IPython.core.display import display, HTML
display(HTML('---<br /><a href="{}">Open in Dimensions ⧉</a>'.format(dimensions_url([GRIDID]))))
#@markdown ---
1 - Prerequisite: Extracting Publications Data¶
By looking at the Dimensions API data model, we can see that the connection between Patents and Publications is represented via a directed arrow going from Patents to Publications: that means that we should look for patents records where the publication_ids field contain references to the GRID-publications we are interested in.
Hence, we need to * a) extract all publications linked to one (or more) GRID IDs, and * b) use these publications to extract patents referencing those publications.
[3]:
# Get full list of publications linked to this organization for the selected time frame
q = f"""search publications
where research_orgs.id="{GRIDID}"
and year in [{YEAR_START}:{YEAR_END}]
return publications[basics+category_for+times_cited]"""
pubs_json = dsl.query_iterative(q)
pubs = pubs_json.as_dataframe()
# save the data
save_as_csv(pubs, f"pubs_{GRIDID}")
1000 / 17126
2000 / 17126
3000 / 17126
4000 / 17126
5000 / 17126
6000 / 17126
7000 / 17126
8000 / 17126
9000 / 17126
10000 / 17126
11000 / 17126
12000 / 17126
13000 / 17126
14000 / 17126
15000 / 17126
16000 / 17126
17000 / 17126
17126 / 17126
===
Saved: extraction1/pubs_grid.89170.37.csv
How many publications per year?¶
Let’s have a quick look a the publication volume per year.
[4]:
px.histogram(pubs, x="year", y="id", color="type", barmode="group", title=f"Publication by year from {GRIDID}")
What are the main subject areas?¶
We can use the Field of Research categories information in publications to obtain a breakdown of the publications by subject areas.
This can be achieved by ‘exploding’ the category_for data into a separate table, since there can be more than one category per publication. The new categories table also retains some basic info about the publications it relates to eg journal, title, publication id etc.. so to make it easier to analyse the data.
[5]:
# ensure key exists in all rows (even if empty)
normalize_key("category_for", pubs_json.publications)
normalize_key("journal", pubs_json.publications)
# explode subjects into separate table
pubs_subjects = json_normalize(pubs_json.publications, record_path=['category_for'],
meta=["id", "type", ["journal", "title"], "year"],
errors='ignore', record_prefix='for_')
# add a new column: category name without digits for better readability
pubs_subjects['topic'] = pubs_subjects['for_name'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
Now we can build a scatter plot that shows the amount and distribution of categories of the years.
[6]:
px.scatter(pubs_subjects, x="year", y="topic", color="type",
hover_name="for_name",
height=800,
marginal_x="histogram", marginal_y="histogram",
title=f"Top publication subjects for {GRIDID} (marginal subplots = X/Y totals)")
2 - Extracting Patents linked to Publications¶
In this section we extract all patents linked to the publications dataset previously created. The steps are the following:
we loop over the publication IDs and create patents queries, via the referencing
publication_idsfield of patentswe collate all patens data, remove duplicates from patents and save the results
finally, we count patents per publication and enrich the original publication dataset with these numbers
[7]:
#
# the main query
#
q = """search patents where publication_ids in {}
return patents[basics+publication_ids+FOR]"""
#
# useful libraries for looping
#
from dimcli.shortcuts import chunks_of
from tqdm import tqdm_notebook as progressbar
#
# let's loop through all grants IDs in chunks and query Dimensions
#
print("===\nExtracting patents data ...")
patents_json = []
BATCHSIZE = 400
VERBOSE = False # set to True to see patents extraction logs
pubsids = pubs['id']
for chunk in progressbar(list(chunks_of(list(pubsids), 400))):
data = dsl.query_iterative(q.format(json.dumps(chunk)), verbose=VERBOSE)
patents_json += data.patents
time.sleep(1)
#
# put the patents data into a dataframe, remove duplicates and save
#
patents = pd.DataFrame().from_dict(patents_json)
print("Patents found: ", len(patents))
patents.drop_duplicates(subset='id', inplace=True)
print("Unique Patents found: ", len(patents))
# save
save_as_csv(patents, f"patents_{GRIDID}")
# turning lists into strings to ensure compatibility with CSV loaded data
# see also: https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list
patents['publication_ids'] = patents['publication_ids'].apply(lambda x: ','.join(map(str, x)))
#
# count patents per publication and enrich the original dataset
#
def patents_per_pub(pubid):
global patents
return patents[patents['publication_ids'].str.contains(pubid)]
print("===\nCounting patents per publication...")
l = []
for x in progressbar(pubsids):
l.append(len(patents_per_pub(x)))
#
# enrich and save the publications data
#
pubs['patents'] = l
save_as_csv(pubs, f"pubs_{GRIDID}_enriched_patents.csv")
===
Extracting patents data ...
Patents found: 4677
Unique Patents found: 3966
===
Saved: extraction1/patents_grid.89170.37.csv
===
Counting patents per publication...
===
Saved: extraction1/pubs_grid.89170.37_enriched_patents.csv.csv
A quick look at the data¶
[8]:
# display top 3 rows
patents.head(3)
[8]:
| FOR | publication_ids | assignees | publication_date | year | granted_year | title | times_cited | id | assignee_names | filing_status | inventor_names | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | None | pub.1038200468,pub.1067588797,pub.1006473292,p... | [{'id': 'grid.8767.e', 'city_name': 'Brussels'... | 2017-09-14 | 2017 | NaN | CD20 BINDING AGENTS AND USES THEREOF | 0 | WO-2017153345-A1 | [VIB VZW, UNIV GENT, UNIV BRUSSEL VRIJE] | Application | [TAVERNIER, JAN, VAN DER HEYDEN, José, DEVOOGD... |
| 1 | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1021290825,pub.1034244510,pub.1022541167 | [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... | 2016-08-18 | 2015 | NaN | Semiconductor Devices Comprising 2D-Materials ... | 11 | US-20160240719-A1 | [Taiwan Semiconductor Manufacturing Co (TSMC) ... | Application | [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue... |
| 2 | [{'id': '2921', 'name': '0912 Materials Engine... | pub.1073831920,pub.1034244510,pub.1021290825,p... | [{'id': 'grid.454156.7', 'city_name': 'Hsinchu... | 2018-01-02 | 2015 | 2018.0 | Semiconductor devices comprising 2D-materials ... | 0 | US-9859115-B2 | [Taiwan Semiconductor Manufacturing Co (TSMC) ... | Grant | [Meng-Yu Lin, Shih-Yen Lin, Si-Chen Lee, Samue... |
Publications now have patents info:
[9]:
pubs.sort_values("patents", ascending=False).head(3)
[9]:
| category_for | times_cited | pages | year | id | issue | volume | author_affiliations | type | title | journal.id | journal.title | journal | patents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13542 | [{'id': '2921', 'name': '0912 Materials Engine... | 55 | 113-121 | 2003 | pub.1008622444 | 2-3 | 4 | [[{'first_name': 'Leonidas C', 'last_name': 'P... | article | High efficiency molecular organic light-emitti... | jour.1047200 | Organic Electronics | NaN | 237 |
| 16286 | [{'id': '2921', 'name': '0912 Materials Engine... | 5 | 341-348 | 2001 | pub.1061127158 | 2 | 49 | [[{'first_name': 'M.', 'last_name': 'Friedman'... | article | Low-Loss RF Transport Over Long Distances | jour.1123356 | IEEE Transactions on Microwave Theory and Tech... | NaN | 150 |
| 11610 | [{'id': '2202', 'name': '02 Physical Sciences'... | 4177 | 435-446 | 2005 | pub.1012397724 | 6 | 4 | [[{'first_name': 'Igor L.', 'last_name': 'Medi... | article | Quantum dot bioconjugates for imaging, labelli... | jour.1031408 | Nature Materials | NaN | 90 |
3 - Patents Data Analysis¶
Now that we have extracted all the data we need, let’s start exploring them by building a few visualizations.
How many patents per year?¶
[10]:
# PS is year correct as a patents field?
px.histogram(patents, x="year", y="id", color="filing_status",
barmode="group",
title=f"Patents referencing publications from {GRIDID} - by year")
Who is filing the patents?¶
This can be done by looking at the field assigness of patent. Since the field contains nested information, first we need to extract it into its own table (similarly to what we’ve done above with publications categories).
[11]:
# ensure the key exists in all rows (even if empty)
normalize_key('assignees', patents_json)
# explode assigness into separate table
patents_assignees = json_normalize(patents_json,
record_path=['assignees'],
meta=['id', 'year', 'title'],
meta_prefix="patent_")
[12]:
top_assignees = patents_assignees.groupby(['name', 'country_name'], as_index=False).count().sort_values(by="patent_id", ascending=False)
px.bar(top_assignees, x="name", y="patent_id",
hover_name="name", color="country_name",
title=f"Top Assignees for patents referencing publications from {GRIDID}")
[13]:
px.scatter(patents_assignees, x="name", y="country_name",
color="patent_year", hover_name="name",
hover_data=["id", "patent_id"], marginal_y="histogram",
title=f"Assignees for patents referencing publications from {GRIDID} - Yearly breakdown")
What are the publications most frequenlty referenced in patents?¶
[14]:
pubs_cited = pubs.query("patents > 0 ").sort_values('patents', ascending=False).copy()
pubs_cited
[14]:
| category_for | times_cited | pages | year | id | issue | volume | author_affiliations | type | title | journal.id | journal.title | journal | patents | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13542 | [{'id': '2921', 'name': '0912 Materials Engine... | 55 | 113-121 | 2003 | pub.1008622444 | 2-3 | 4 | [[{'first_name': 'Leonidas C', 'last_name': 'P... | article | High efficiency molecular organic light-emitti... | jour.1047200 | Organic Electronics | NaN | 237 |
| 16286 | [{'id': '2921', 'name': '0912 Materials Engine... | 5 | 341-348 | 2001 | pub.1061127158 | 2 | 49 | [[{'first_name': 'M.', 'last_name': 'Friedman'... | article | Low-Loss RF Transport Over Long Distances | jour.1123356 | IEEE Transactions on Microwave Theory and Tech... | NaN | 150 |
| 11610 | [{'id': '2202', 'name': '02 Physical Sciences'... | 4177 | 435-446 | 2005 | pub.1012397724 | 6 | 4 | [[{'first_name': 'Igor L.', 'last_name': 'Medi... | article | Quantum dot bioconjugates for imaging, labelli... | jour.1031408 | Nature Materials | NaN | 90 |
| 14007 | [{'id': '2581', 'name': '0601 Biochemistry and... | 1461 | 47-51 | 2003 | pub.1036950132 | 1 | 21 | [[{'first_name': 'Jyoti K.', 'last_name': 'Jai... | article | Long-term multiple color imaging of live cells... | jour.1115214 | Nature Biotechnology | NaN | 69 |
| 11331 | [{'id': '2921', 'name': '0912 Materials Engine... | 14 | 2207-2214 | 2005 | pub.1061591798 | 10 | 52 | [[{'first_name': 'Haizhou', 'last_name': 'Yin'... | article | Ultrathin Strained-SOI by Stress Balance on Co... | jour.1019990 | IEEE Transactions on Electron Devices | NaN | 67 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 10248 | [{'id': '2401', 'name': '0204 Condensed Matter... | 34 | 2817-2823 | 2006 | pub.1006364633 | 11 | 21 | [[{'first_name': 'Chulho', 'last_name': 'Song'... | article | Optical enzymatic detection of glucose based o... | jour.1357547 | Journal of Materials Research | NaN | 1 |
| 10258 | [{'id': '2202', 'name': '02 Physical Sciences'... | 14 | 174410 | 2006 | pub.1060619172 | 17 | 74 | [[{'first_name': 'K. L.', 'last_name': 'Sauer'... | article | Spin dynamics in the pulsed spin locking of nu... | jour.1320488 | Physical Review B | NaN | 1 |
| 10290 | [{'id': '2447', 'name': '0303 Macromolecular a... | 43 | 21487-96 | 2006 | pub.1056066440 | 43 | 110 | [[{'first_name': 'Cynthia N', 'last_name': 'Ko... | article | Triarylphosphine-stabilized platinum nanoparti... | jour.1031548 | The Journal of Physical Chemistry B | NaN | 1 |
| 10317 | [{'id': '2208', 'name': '08 Information and Co... | 0 | 31-32 | 2006 | pub.1047540916 | 4 | 10 | [[{'first_name': 'Ian D.', 'last_name': 'Chake... | article | Mobile ad hoc networking and the IETF | jour.1139676 | ACM SIGMOBILE Mobile Computing and Communicati... | NaN | 1 |
| 17106 | [{'id': '2867', 'name': '0906 Electrical and E... | 114 | 1115-1126 | 2000 | pub.1061215656 | 4 | 36 | [[{'first_name': 'M.', 'last_name': 'Steiner',... | article | Fast converging adaptive processor or a struct... | jour.1022254 | IEEE Transactions on Aerospace and Electronic ... | NaN | 1 |
1050 rows × 14 columns
[15]:
px.bar(pubs_cited[:1000], color="type",
x="year", y="patents",
hover_name="title", hover_data=["journal.title"],
title=f"Top Publications from {GRIDID} mentioned in patents, by year of publication")
What are the main subject areas of referenced publications?¶
[16]:
THRESHOLD_PUBS = 1000
citedids = list(pubs_cited[:THRESHOLD_PUBS]['id'])
pubs_subjects_cited = pubs_subjects[pubs_subjects['id'].isin(citedids)]
[17]:
px.scatter(pubs_subjects_cited, x="year", y="topic", color="type",
hover_name="for_name",
height=800,
marginal_x="histogram", marginal_y="histogram",
title=f"Top {THRESHOLD_PUBS} {GRIDID} publications cited by patents - by subject area")
Is there a correlation from publication citations to patents citations?¶
Note: if the points on a scatterplot graph produce a lower-left-to-upper-right pattern (see below), that is indicative of a positive correlation between the two variables. This pattern means that when the score of one observation is high, we expect the score of the other observation to be high as well, and vice versa.
[18]:
px.scatter(pubs, x="patents", y="times_cited",
title=f"Citations of {GRIDID} publications from publications VS from patents")
Conclusions¶
In this Dimensions Analytics API tutorial we have seen how, starting from a GRID organization, it is possible to extract
publications from authors associated to this organization
patents citing those publications (from any organization)
We have also done a basic analysis of the citing patents dataset, using fields like citation year, assigness etc…
This only scratches the surface of the possible applications of publication-patents linkage data, but hopefully it’ll give you a few basic tools to get started building your own application!
Note
The Dimensions API Lab contains many notebooks like this one, demonstrating how to carry out useful scholarly analytics tasks using the Analytics API. Check out also the associated Github repository for all the source code files and much more.